Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
##What do you expect to learn? Where did you hear about the course?
I hope to learn mostly modeling and clustering with r. I think that this course also provides me with the opportunity to refresh my memory on working with r and it’s functions.I I heard about this course from our kvantitative studies teacher Arho Toikka, who highly recommended this course for me. Seeing as i was already very interested about data science, this seemed like the perfect opportunity to better my knowledge on the matter.
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
install.packages("GGally", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
analysisData <- read.table('data/learning2014')
summary(analysisData$Points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 23.00 22.72 27.75 33.00
“This data is collected from the social science students of the Univeristy of Helsinki. The survey was conducted in a social sciences statistical course for bachelor students in 2014 December to 2015 January. The survey consists of 7 variables that describe the gender (factored as female or male), age, attitude (towards the study of statistics), deep learning (pupils own attitudes and practices towards deep learning), surface learning and strategic learning as well as their received points in the final exam of the course. The learning scales mostly try to get the student to show what kind of strategies and types of learning they try to implement in their studies. All the learning categories are number values in likert-scales (values from 1 to 5). Points, attitude and age are integer values. The survey was completed in Finnish.”
ggpairs(analysisData, lower = list(combo = wrap("facethist", bins = 20)))
“In out pairs plot we can see that the correlations between the variables are quite low. Only a real significant correlation can be found between attitude and points (0,44) and deep learning strategies and surface level strategies (the correlation between the two strategies is negative).”
“Almost all of the variances of the variables are noramlly distributed.Tge onyl exception is age, which is explained due to the fact that the participants of the survey are students, who are young and mostly the same age.”
“In most of the cases we cannot see a clear linear pattern in the residuals between two variables. Some show a clear pattern, like attitude and surface learning strategies, where the residuals are skewed towards negative values.”
myRegressionModel <- lm(Points ~ Attitude + stra + surf, data = analysisData)
summary(myRegressionModel)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = analysisData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
“Here i created a linear regression model where the students attitude, strategic learning techniques and surface techniques try to explain the students achievement levels in the final exam of the course. In this first model, we can see that surface and stategic learning do not have any significance on the students outcomes in the course test. Their attitude towards the study of statistics, however, proves to be very significant.”
secondModel <- lm(Points ~ Attitude, data = analysisData)
summary(secondModel)
##
## Call:
## lm(formula = Points ~ Attitude, data = analysisData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
“In the second model, we can see that the students attitudes have an extremly significant causal relation to the points that they receive in the final exam. This means that the likelyhood to get a dataset this deviant is less than one percent. In this model one point in attitude increased the points that a student got in the exam by 0.35. The multiple R-squared means that the variance of the points in the test can be explained by about 19 percent by the differences in the attitudes of the students.”
par(mfrow = c(2,2))
plot(x=secondModel, which= c(1,2,5))
“In the assumptions of the model, we assume the variables are normally distributed and that their variance is constant. From the residuals vs fitted, we can see that there seems to be no clear pattern in the distribution of the variables and thefore we can conclude that the model is quite linear. Also the variance is seems to be quite constant as the datapoints dont seem to stray too much from the mean. The In the normalQQ- plot we can see that our data pretty much follows the line. This means that the data is normally distributed. In the residuals and leverage plot we see that there arent really any outliers that have a big leverage on the rest of the data. The data is bit skewed down, but not significantly.”
library(dplyr)
library(ggplot2)
alc <- read.table(file = "data/alc", header = TRUE)
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The student alcohol consumption data is based on the UCI-data of student achievement in seondary education in two Portugese schools. Variables consist of the students sex, grades, absences and health information, for example. The data is combined from two datasets: the dataset that describes students performance in Mathematics and the dataset that describes students performance in the Portugese language. The combined dataset also shows the combined grades of the students in all three years of secondary education. The alcohon consumption is measured with the variable “alc_use” and high alcohol consumption with the variable “high_use”. If the alcohol consumption has a value of over 2, is the student considered to use a lot of alcohol and therefore the “high_use” - variable receives a value of “true”.
For my analysis i have taken variables sex, Mothers education, failures, and absences as my indicators of alcohol consumption. I believe that the sex of a person, failures in recent tests and absences from classes offer great explanatory power over a persons consumption of alcohol. I believe that the drinking habits differ among boys and girls. I also believe that misfortune and failures drive people into drinking. Therefore failures in courses might also lead to absences. Together however and separately I still belive they lead into higher alcohol consumption. As my fourth variable I chose the level of the child’s Mother’s education. This is due to the belief that educated parents warn their children more about alchol and also don’t provide a model of heavy drinking that often. Therefore higher education would mean lower levels of alcohol consumption.
str(alc$Medu)
## int [1:382] 4 1 1 4 3 4 2 4 3 3 ...
summary(alc$Medu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.806 4.000 4.000
alc$sex
## [1] F F F F F M M F M M F F M M M F F F M M M M M M F F M M M M M M M M M
## [36] F M M F F F M M M F F F M M F F F M F F F F M M F F F F F F F M F F F
## [71] M M F M F M M F M M F M M F M F F F F M M F F F F M F M F F F M M M F
## [106] M F F M M M F M F F M M M M M M M M F M F M F M F F M F F F F M F M F
## [141] M F M M F F M M F F F M M M M M F M F M M F M M M M M F F F M M M F F
## [176] M F M M M M M M F F F M M M F M F F M M M F M M F F F F F F F F F F F
## [211] F F M F M F F F M F F F F F M F F F M M F F M M M M M M F F M M M M M
## [246] M F M M M M M M M M M M M F M M F F M M F F M M F M F F F F M F F F M
## [281] M F M M M F F F F M F F M M M F F M M F F F M F M F F M M F F F F F F
## [316] F F F M M M F F F M F F F F F F F F M M M F F F M M F M M M M M M F F
## [351] F M F F F F M M F F F M F F F F F F F M M M M M F F F F F F M M
## Levels: F M
alc$failures
## [1] 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [71] 0 0 1 0 0 0 0 0 3 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 1
## [141] 2 0 0 1 0 0 3 2 0 2 0 0 0 2 2 1 1 2 0 0 2 3 0 1 2 2 0 0 0 0 2 0 0 2 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 0
## [211] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [246] 0 0 0 0 3 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [281] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [316] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0
## [351] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 2 0 0 0
alc$absences
## [1] 5 3 8 1 2 8 0 4 0 0 1 2 1 1 0 5 8 3 9 5 0 0 1
## [24] 1 2 10 5 2 3 10 0 1 0 0 2 2 1 6 2 8 20 8 1 0 14 6
## [47] 9 3 3 2 1 1 5 0 3 5 0 6 1 2 3 3 2 1 0 2 2 2 1
## [70] 9 1 0 2 1 29 3 4 0 1 12 13 1 3 7 3 2 5 5 4 9 12 1
## [93] 5 2 1 4 3 4 1 5 1 13 0 3 21 0 10 6 3 1 7 3 5 2 9
## [116] 10 6 4 9 3 3 17 4 1 6 2 11 0 1 0 1 8 0 3 7 14 1 2
## [139] 1 2 0 3 3 6 2 3 0 11 0 1 4 3 8 0 0 6 5 3 0 1 11
## [162] 6 3 2 4 0 2 0 0 0 2 1 0 0 2 2 1 4 9 4 7 3 1 0
## [185] 44 11 9 1 0 8 5 8 0 14 4 0 4 4 12 27 0 2 5 2 20 6 21
## [208] 4 7 4 3 7 14 0 12 9 2 19 12 4 2 1 6 1 4 2 10 6 1 8
## [231] 5 7 2 7 1 10 5 2 12 2 8 11 1 4 0 2 5 6 3 14 8 0 4
## [254] 5 4 1 0 1 4 8 5 1 12 2 2 3 0 10 1 8 7 6 16 7 2 2
## [277] 4 6 45 14 11 8 4 26 18 2 2 2 8 1 2 3 3 4 6 0 3 4 2
## [300] 1 5 0 2 7 0 0 0 0 0 1 6 1 1 18 8 2 0 0 2 2 5 4
## [323] 4 4 1 4 4 0 3 12 0 6 2 3 0 8 9 2 6 6 0 0 9 8 3
## [346] 4 4 4 0 3 1 3 0 3 2 0 2 0 0 0 5 3 8 9 0 3 2 0
## [369] 14 4 2 3 0 7 1 6 2 2 2 3 4 2
alc$high_use
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [23] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [45] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
## [56] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
## [67] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [89] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [133] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [155] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [166] TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [177] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## [188] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [199] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [210] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
## [221] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [232] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [243] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## [254] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [276] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [298] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [320] TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [331] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [342] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [353] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [364] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
## [375] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
g1 <- ggplot(data = alc, aes(x = high_use, fill = sex))
g1 + geom_bar()
g2 <- ggplot(data = alc, aes(x=failures, fill = high_use))
g2 + geom_bar()
g3 <- ggplot(data = alc, aes(x = absences, fill = high_use))
g3 + geom_bar()
g4 <- ggplot(data = alc, aes(x=Medu, fill = high_use))
g4 + geom_bar()
From the plots that i have drawn out of the relationships between high use of alcohol and sex, failures, absences and Mothers eduaction. The Medu variable is split into 5 levels from 0 to 4, 0 being no education at all to 4 being higher education. In the education level- plot we can’t see any strong correlation between high alcohol consumption and the student’s mother’s education level.
In the sex plot we can see that men tend to drink more than women. In the dataset there seems to be more women than men, but the men are more heavily represented in the category of more heavy drinkers.
In the remaining absences and failures plots the trend tends to be that the more absences or failures a person has, the more he tends to drink. In the absence-variable this can be seen more clearly, but in the failures-variable it is hard to say anything too conclusive, due to the small sample size of the categories. The students tend not to have many failing grades.
m <- glm(high_use ~ failures + absences + sex + Medu, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + Medu, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1839 -0.8371 -0.6004 1.1004 2.0266
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.916369 0.384473 -4.984 6.22e-07 ***
## failures 0.453431 0.199383 2.274 0.02296 *
## absences 0.093094 0.023125 4.026 5.68e-05 ***
## sexM 0.939703 0.244341 3.846 0.00012 ***
## Medu 0.005038 0.116739 0.043 0.96558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.40 on 377 degrees of freedom
## AIC: 434.4
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1471403 0.06747099 0.3057165
## failures 1.5737029 1.06754868 2.3457065
## absences 1.0975651 1.05122651 1.1511339
## sexM 2.5592205 1.59425355 4.1629212
## Medu 1.0050508 0.80004712 1.2657830
Above is the results of the logistic regression model from four variables (sex, failures, absences and Mothers education level) and their explanatory power to high alcohol usage. According to the summary of the linear regression model, absences and sex seem to be good explanatory power. Every increase of a unit in failure increases the chances of becoming a high user of alcohol by about 0.45 units. In sex it is clear that we can only have men and women as categories, therefore making it very much more likely for the student to become a high user of alcohol if the subjects are men instead of women.
In the 95 % confidence interval levels we can see that the odds ratio to failures are about 1.07 in the lower tail and about 2.35 on the upper tail of the distribution. For absence the digits are 1.05 and 1.15, for sex (Male) it is 1.59 and 4.16. This means that the odds of success for failure (the students have failures and consume high ammount of alcohol) ranges from about 1.06 to about 2.34 on 95 % confidence interval.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67801047 0.02356021 0.70157068
## TRUE 0.21989529 0.07853403 0.29842932
## Sum 0.89790576 0.10209424 1.00000000
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2434555
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801
In the predictions above we can conclude that our model has a 24 % probability of being incorrect. Therefore it can correctly predict the heavy consumption of alcohol in about 76% of the cases. The model predicts too many students not to be heavy drinkers, seeing as it thinks almost 90% of the cases are not heavy consumptors of alcohol, when in fact only 67% are. The model therefore gives too optimistic image of alcohol consumption among students.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(corrplot)
## corrplot 0.84 loaded
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(ggplot2)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
?Boston
In this exercise we will be using the Housing balues in the suburbs of Boston - dataset. The dataset has 506 observations and 14 variables. Variables are:
crim = crimerate per capita, zn = proportion of residental land, indus = proportion of non-retail business (industry) acres per town, black = proportions of blacks by town. and other population and housing situation describing variables in the Boston suburbs. Here we will be focusing mostly on the crimerates.
I will now use pairs and the correlation plot to describe the relationships between variables.
pairs(Boston)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
There are many variables that have a positive and negative correlation together. As we can see from the correlation plot, the strngest correlation between variables seems to be with the index of accessibility to radial highways and full property-tax rate per ten thousand dollars. Strong negative correlation can be found between variables such as the proportion of owner-occupied units prior to 1940 (age-variable) and the weighted mean of distances to the Boston employment centres (dis). The red dots convey a more negative correlation and blue dots a positive one. The bigger the circle seems to be, the more darker the color and the stronger the correlation.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, labels = c("low", "med_low", "med_high", "high"), breaks = c(bins), include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Here our goal is to standardize the data for further analysis, so we scale it (scaling reduces the mean from each column value and divides it by standard deviation). Furthermore we divide the crime-variable into quantiles and give the quantiles labels. Now the crimerates of each area of the suburbs of Boston become a categorical variable. Lastly we drop the old crimrate variable from the dataframe and add the new one into it’s place. Then we divide the data into 5 categories which 4 are meant for testing and 1 is meant for training our clustering and classification models.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2747525 0.2277228 0.2425743 0.2549505
##
## Group means:
## zn indus chas nox rm
## low 1.0039957 -0.9267934 -0.09498221 -0.9092573 0.46606422
## med_low -0.1234836 -0.2025559 0.02723291 -0.5375839 -0.13775149
## med_high -0.3804849 0.1313309 0.12941585 0.4068228 0.04673242
## high -0.4872402 1.0170891 -0.08120770 1.0693742 -0.39463194
## age dis rad tax ptratio
## low -0.9257136 0.9285967 -0.7004450 -0.7526518 -0.44303064
## med_low -0.2598340 0.3295556 -0.5586861 -0.4608595 0.01351028
## med_high 0.3904025 -0.3562795 -0.4017782 -0.3079465 -0.22596738
## high 0.8088362 -0.8437082 1.6384176 1.5142626 0.78111358
## black lstat medv
## low 0.37345701 -0.76780926 0.56035901
## med_low 0.35797874 -0.08499805 -0.03512233
## med_high 0.07083964 0.06008625 0.10072490
## high -0.80242024 0.87625939 -0.70376633
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07734481 0.68749803 -0.70745863
## indus -0.02930265 -0.16297561 0.62118690
## chas -0.09549043 -0.05862914 0.15399036
## nox 0.36829633 -0.69151781 -1.55006461
## rm -0.13258567 -0.06090554 -0.12620716
## age 0.25392857 -0.35342625 0.02293232
## dis -0.07962346 -0.22509412 0.19045764
## rad 3.11361209 0.92763814 0.08822090
## tax 0.07267982 0.01264594 0.33262317
## ptratio 0.10508628 -0.01753996 -0.24370129
## black -0.13147599 -0.02171838 0.15605126
## lstat 0.18549279 -0.23905066 0.29828563
## medv 0.18393874 -0.32436819 -0.25703563
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9497 0.0373 0.0130
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
Here we create a linear discriminate analysis model with crimerates as the target variable. Our explanatory variable is all of the other variables.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 5 1 0
## med_low 10 17 7 0
## med_high 0 8 19 1
## high 0 0 0 24
Here we cross-tabled our predictive model and it’s results with the correct results of the dataset. Our model was quite good at predicting high - crimerate areas. However the model is prone to uncorrectly predict medium high crimerate areas to medium lows, medium low areas to lows and low crimerate areas to medium low. The higher the crimerate are turly was, the better the power of the models prediction.
library(MASS)
# Scale the variables to get comparable distances
boston_scaled2 <- scale(Boston)
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# plot the Boston dataset clusters in sets of 5
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:14], col = km$cluster)
As we can see from our plot calculating the wcss of our klustering, the value of wcss dramatically drops when we have two clusters instead of one. When we add additional clusters to it, the value does not change all that much. Therefore it seems most reasonable to continue to kluster the variables with k-means using two clusters.
In the above example I plotted the variables with each other using the pairs- function. The pairs have been illustrated in sets of 5, from the first variable to the fifth, sixth to tenth and eleventh to fourteenth (only 14 variables in the data), to help with the visualization. In many of the cases the clusters are quite unclear and the clusters usually are made into “low” and “high” - categories.
install.packages(“dplyr”, repos = “http://cran.us.r-project.org”) install.packages(“tidyr”, repos = “http://cran.us.r-project.org”) install.packages(“stringr”, repos = “http://cran.us.r-project.org”) install.packages(“ggplot2”, repos = “http://cran.us.r-project.org”) install.packages(“GGally”, repos = “http://cran.us.r-project.org”) install.packages(“corrplot”, repos = “http://cran.us.r-project.org”) install.packages(“FactoMineR”, repos = “http://cran.us.r-project.org”)
library(dplyr) library(tidyr) library(stringr) library(ggplot2) library(GGally) library(corrplot)
human <- read.table("/Users/Lotta/IODS-project/data/human", header = TRUE)
# visualize the 'human' variables
ggpairs(human)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ eduDiff : num 1.007 0.997 0.983 0.989 0.969 ...
## $ workDiff : num 0.891 0.819 0.825 0.884 0.829 ...
## $ lifeExpectancy : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ eduExpectancy : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNICapita : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ motherMortality: int 4 6 6 5 6 7 9 28 11 8 ...
## $ teenMoms : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ parliamentRep : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## eduDiff workDiff lifeExpectancy eduExpectancy
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNICapita motherMortality teenMoms parliamentRep
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot
We have a dataset consisting of 155 observations and 8 variables. In the plots and especially correlation plot we can see that there is a string positive correlation between life expectancy and education expectancy. Also other notable correlations are:
Mother mortalityrate and life expectancy (negtive correlation). GNI and life expectancy and educational expectancy GNI and mothers mortality and adolecent birthrate (small negative correlation)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## eduDiff workDiff lifeExpectancy eduExpectancy
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNICapita motherMortality teenMoms parliamentRep
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## eduDiff -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## workDiff 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## lifeExpectancy -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## eduExpectancy -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## GNICapita -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## motherMortality 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## teenMoms 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## parliamentRep -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## eduDiff 0.17713316 0.05773644 0.16459453
## workDiff -0.03500707 -0.22729927 -0.07304568
## lifeExpectancy -0.42242796 -0.43406432 0.62737008
## eduExpectancy -0.38606919 0.77962966 -0.05415984
## GNICapita 0.11120305 -0.13711838 -0.16961173
## motherMortality 0.17370039 0.35380306 0.72193946
## teenMoms -0.76056557 -0.06897064 -0.14335186
## parliamentRep 0.13749772 0.00568387 -0.02306476
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
When we did not scale our variables our principal component analysis didnt know how to deal with the variance of each of our variable. Therefore the result was that almost all of our variables had variance that the analysis could not handle. Stronger variance results in more important variables, but because the variables were not scaled certain variables might have had a falty variance. After the scaling this was not the case.
We can see that the first principal component (dimension) resulted of 53.6 % of variance of our variables. The second principal component resulted in 16.2 % of the variables variance. Our analysis was able to therefore separate two strong principal components that are not correlated with each other.
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Here in the first plot we can see that many of the people that drink tea most likely like to drink it without sugar, earl gray and at a chain store. The tea should be drunk at a time that is not lunch and is from a tea bag.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
Dimension one accounts for 15.24% percentage of the variance of the variabes assigned to it. The second dimension is not the significantly less important as it covers 14.23% of the variance. People that want to drink their tea from a tea bag most likely want to drink it at a chain store and with sugar. They also tend to want to drink Earl Gray. Participants that want unpackaged tea usually go for more green tea in a tea shop.
install.packages(“dplyr”, repos = “http://cran.us.r-project.org”) install.packages(“ggplo2”, repos = “http://cran.us.r-project.org” ) install.packages(“GGally”, repos = “http://cran.us.r-project.org”)
library(dplyr) library(ggplot2) library(GGally)
BPRS <- read.table("data/BPRS")
RATS <- read.table("data/RATS")
str(BPRS)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
str(RATS)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
# Factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
# Factor variables ID and Group
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
# Draw the plot
ggplot(RATS, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS$Weight), max(RATS$Weight)))
### Unstandadized plot of the RATS-plot
asdas
RATS <- RATS %>%
group_by(Time) %>%
mutate(stdWeight = ((Weight - mean(Weight)) / sd(Weight))) %>%
ungroup()
# Plot again with the standardised bprs
ggplot(RATS, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
# Number of WD, baseline (week 0) included
n <- max(RATS$Time)
n
## [1] 64
# Summary data with mean and standard error of Weight by weekday
RATSL <- RATS %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = (sd(Weight)/ sqrt(n) )) %>%
ungroup()
# Glimpse the data
glimpse(RATSL)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.…
## $ se <dbl> 1.902697, 1.636634, 1.434488, 1.700102, 1.382185, 1.472971…
# Plot the mean profiles
ggplot(RATSL, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline time = 1).
RATS8S <- RATS %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
# Glimpse the data
glimpse(RATS8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273…
RATS$Weight
## [1] 240 225 245 260 255 260 275 245 410 405 445 555 470 535 520 510 250
## [18] 230 250 255 260 265 275 255 415 420 445 560 465 525 525 510 255 230
## [35] 250 255 255 270 260 260 425 430 450 565 475 530 530 520 260 232 255
## [52] 265 270 275 270 268 428 440 452 580 485 533 540 515 262 240 262 265
## [69] 270 275 273 270 438 448 455 590 487 535 543 530 258 240 265 268 273
## [86] 277 274 265 443 460 455 597 493 540 546 538 266 243 267 270 274 278
## [103] 276 265 442 458 451 595 493 525 538 535 266 244 267 272 273 278 271
## [120] 267 446 464 450 595 504 530 544 542 265 238 264 274 276 284 282 273
## [137] 456 475 462 612 507 543 553 550 272 247 268 273 278 279 281 274 468
## [154] 484 466 618 518 544 555 553 278 245 269 275 280 281 284 278 478 496
## [171] 472 628 525 559 548 569
# Draw a boxplot of the mean versus treatment
ggplot(RATS8S, aes(x = Group, y = mean)) +
geom_boxplot() + ggtitle("Boxplot of weights in groups") +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), weekdays 1-64")
Here we can see that all of the groups have an outlier. With group 1 and 3 the outlier is dragging the mean of the groups weights down, while in group 2 it is the opposite.
str(BPRS)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
# Plot the BPRS - data
ggplot(BPRS, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype = subject)) + scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRS)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Week seems to be a significant indicator in values of the phychiatric scale. This means that every week in treatments seems to significantly lower the persons mental illness scale.
install.packages("lme4", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | week), data = BPRS, REML = FALSE)
## boundary (singular) fit: see ?isSingular
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | week)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2839.3 2858.8 -1414.7 2829.3 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8235 -0.7281 -0.2596 0.5687 4.0804
##
## Random effects:
## Groups Name Variance Std.Dev.
## week (Intercept) 5.648e-16 2.377e-08
## Residual 1.516e+02 1.231e+01
## Number of obs: 360, groups: week, 9
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.3613 34.124
## week -2.2704 0.2513 -9.033
## treatment2 0.5722 1.2980 0.441
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.739
## treatment2 -0.477 0.000
## convergence code: 0
## boundary (singular) fit: see ?isSingular
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | week)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2839.3 2858.8 -1414.7 2829.3
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 97.897 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + week * subject, data = BPRS, REML = FALSE)
## boundary (singular) fit: see ?isSingular
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * subject
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2717.7 2892.6 -1313.9 2627.7 315
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9828 -0.6291 -0.0213 0.5850 4.0460
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 1.650e-07 0.0004062
## week 1.663e-08 0.0001290 -1.00
## Residual 8.661e+01 9.3061824
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 44.0028 4.0742 10.800
## week 0.2333 0.8495 0.275
## treatment2 0.5722 0.9810 0.583
## subject2 3.2778 5.7199 0.573
## subject3 3.8333 5.7199 0.670
## subject4 4.0778 5.7199 0.713
## subject5 27.5778 5.7199 4.821
## subject6 -3.5778 5.7199 -0.625
## subject7 12.2444 5.7199 2.141
## subject8 -0.5778 5.7199 -0.101
## subject9 -2.9222 5.7199 -0.511
## subject10 8.2333 5.7199 1.439
## subject11 13.7556 5.7199 2.405
## subject12 3.7444 5.7199 0.655
## subject13 6.5556 5.7199 1.146
## subject14 -6.4222 5.7199 -1.123
## subject15 12.5222 5.7199 2.189
## subject16 -0.4444 5.7199 -0.078
## subject17 -4.8000 5.7199 -0.839
## subject18 -12.1444 5.7199 -2.123
## subject19 -8.6889 5.7199 -1.519
## subject20 -7.2222 5.7199 -1.263
## week:subject2 -3.6667 1.2014 -3.052
## week:subject3 -3.5417 1.2014 -2.948
## week:subject4 -2.1583 1.2014 -1.796
## week:subject5 -5.4917 1.2014 -4.571
## week:subject6 -2.3833 1.2014 -1.984
## week:subject7 -4.0333 1.2014 -3.357
## week:subject8 -0.5500 1.2014 -0.458
## week:subject9 -3.1583 1.2014 -2.629
## week:subject10 -2.3500 1.2014 -1.956
## week:subject11 -1.2167 1.2014 -1.013
## week:subject12 -2.9500 1.2014 -2.455
## week:subject13 -3.3750 1.2014 -2.809
## week:subject14 -1.7417 1.2014 -1.450
## week:subject15 -3.8667 1.2014 -3.218
## week:subject16 -3.0833 1.2014 -2.566
## week:subject17 -0.6750 1.2014 -0.562
## week:subject18 -0.9917 1.2014 -0.825
## week:subject19 -2.2583 1.2014 -1.880
## week:subject20 -2.5833 1.2014 -2.150
##
## Correlation matrix not shown by default, as p = 41 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
## convergence code: 0
## boundary (singular) fit: see ?isSingular
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + week * subject
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 45 2717.7 2892.6 -1313.9 2627.7 103.72 38 5.22e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# draw the plot of RATSL with the observed Weight values
ggplot(BPRS, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype = subject)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Observed phyciatric scale") +
theme(legend.position = "top")
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
# Create a new column fitted to RATSL
BPRS <- BPRS %>% mutate(Fitted)
# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRS, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(linetype = subject)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Fitted psychiatric scale values") +
theme(legend.position = "top")